-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(performance): query improvements for trends (load less people) #23135
Conversation
📸 UI snapshots have been updated1 snapshot changes in total. 0 added, 1 modified, 0 deleted:
Triggered by this commit. |
Size Change: 0 B Total Size: 1.06 MB ℹ️ View Unchanged
|
📸 UI snapshots have been updated2 snapshot changes in total. 0 added, 2 modified, 0 deleted:
Triggered by this commit. |
WHERE and(equals(person.team_id, 2), in(id, | ||
(SELECT source.actor_id AS actor_id | ||
FROM | ||
(SELECT actor_id AS actor_id, count() AS event_count, groupUniqArray(100)(tuple(timestamp, uuid, `$session_id`, `$window_id`)) AS matching_events | ||
FROM | ||
(SELECT e.person_id AS actor_id, toTimeZone(e.timestamp, 'UTC') AS timestamp, e.uuid AS uuid, e.`$session_id` AS `$session_id`, e.`$window_id` AS `$window_id` | ||
FROM events AS e | ||
LEFT JOIN | ||
(SELECT argMax(replaceRegexpAll(nullIf(nullIf(JSONExtractRaw(groups.group_properties, 'industry'), ''), 'null'), '^"|"$', ''), toTimeZone(groups._timestamp, 'UTC')) AS properties___industry, groups.group_type_index AS index, groups.group_key AS key | ||
FROM groups | ||
WHERE and(equals(groups.team_id, 2), ifNull(equals(index, 0), 0)) | ||
GROUP BY groups.group_type_index, groups.group_key) AS e__group_0 ON equals(e.`$group_0`, e__group_0.key) | ||
WHERE and(equals(e.team_id, 2), equals(e.event, 'sign up'), greaterOrEquals(toTimeZone(e.timestamp, 'UTC'), toDateTime64('2020-01-02 00:00:00.000000', 6, 'UTC')), less(toTimeZone(e.timestamp, 'UTC'), toDateTime64('2020-01-03 00:00:00.000000', 6, 'UTC')), ifNull(equals(e__group_0.properties___industry, 'technology'), 0))) | ||
GROUP BY actor_id SETTINGS use_query_cache=1, query_cache_ttl=600) AS source))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This case of the optimization kicking in seems massive, but seems legit at a glance. I do wonder how fast this is, given the in()
's SELECT is triple-nested. 😅 Probably depends on whether the events time range is large (can be months for an aggregate query) or tiny (a single day). (This is not actionable, I'm just wondering)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sort of got buried but discussed the performance difference with Marius above. I think this will (almost) always make things faster, because if you have enough events to have the inner queries run slow, you will also have a lot of people for the inefficient join.
If this ends up slowing down small users, we can figure out a way to create a set of "large user optimizations" that maybe only get turned on once people hit some event limit.
This will be much faster once materialized CTEs get landed in Clickhouse.
I like this a lot! Especially without |
Suspect IssuesThis pull request was deployed and Sentry observed the following issues:
Did you find this useful? React with a 👍 or 👎 |
Problem
MX trends actor queries are failing https://posthog.slack.com/archives/C0368RPHLQH/p1718782039244149
They are OOMing because we have too many persons and PDIs. MX has 50,000,000 PDI entries and 200,000,000 people, and joining events against them takes > 36GB of memory in Clickhouse.
Changes
This adds a new alias for the persons table, called "filterable_persons". This new table looks into the
join ON
expression, looking for anid IN
clause. If it finds one, it brings that inside of the subselect, to prevent us from having to pull all people.This behavior is similar to the
WhereClauseExtractor
, but theWhereClauseExtrator
is general and conservative.filterable_persons
assumes you know what you're doing in the query, and skips the checks. It would be much too hard to expand theWhereClauseExtractor
to know that it could bring the entire subselect inside the where.It achieves reuse by putting a
use_query_cache
setting with a 600 (HOGQL_INCREASED_MAX_EXECUTION_TIME) secondquery_cache_ttl
on the cache value. This means that if the exact same source query is rerun for the next 10 minutes, the results will be the same, but we shouldn't be allowing people run queries that frequently so it shouldn't be an issue. If it becomes one, we can put some sort of nonce on the CTE to only allow caching in this query.We also
optimize_aggregation_in_order
for the subquery with grouping on person_distinct queries. This helps (a lot - 16GB to 1GB) with memory usage in a vacuum, but doesn't matter too much once you try to join against it.Unrelated but included:
Followups
WhereClauseExtractor
and make sure we're using it. This will help inside of event queries where we join against people and then filter on person properties.How did you test this code?
Wrote tests. Looked at the SQL. Ran a bunch of actors queries with breakdowns and filters locally.